Statements
Executive Summary
1. Introduction
2. Aim and Methodology of
this Project
3. Exploratory Data Analysis -
EDA
4. Machine Learning Model
5. Results &
Discussion
6. Conclusion
Acknowledgement
I am sincerely thank the OpenGeoHub Foundation for their work in making
open source geospatial models for everyone. The videos posted in their
website on tutorials to learn remote sensing concepts and understand the
workflow structures in working with geospatial data analysis. Particular
mention to the tutorial “Machine Learning for Spatial Data” by
Dr. Hanna Meyer which helped me to work on this project.
Furthermore, I thank the professional certification courses relevant to understanding programming, machine learning, remote sensing concepts which helped me to work on this project.
Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating
charts and adjusting visualization parameters in R. GenAI was also used
for code debugging. However, the responses provided by GenAI were
critically judged before being implemented.
Problem Statement
Gorse (Ulex europaeus) plant was brought to
New Zealand during the European settlement as a hedge plant and spreads
into farmlands which is associated with negative consequences for the
quality of the grassland. Each year millions of dollars are spent for
its control and field sampling are expensive for developing management
strategies. It’s of high importance to map the distribution of such
invasive plants using remote sensing methods.
Project Aim and Focus
The project aim and focus is on using Machine Learning models to predict
the invasive species gorse (Ulex europaeus) on
the Banks Peninsula of New Zealand.
Raw data used
Sentinel satellite data and plotted land cover class of a section of the
Banks Peninsula of New Zealand.
Methodology
The following methodology was undertaken for this project,
- Raw data and land cover classes are visualized and merged to a single
data set to build the Machine Learning models
(Random Forest, XGBoost and
Support Vector Machine).
- The merged data set is split into 30% training and 70% test data which
is used to train and predict using the models.
- Various analysis such as confusion matrix, scoring metrics, predictive
& entropy maps, and uncertainty diagnostics were performed to
analyse the models performance in land cover classification and
predicting the invasive plant species.
Results
Out of the three models, RF and XGBoost model
performs very well and SVM model is not up to par in
classifying the land cover class using the satellite images. Based on
the various analysis, Random Forest with the highest
accuracy of 99.33%, Model outperforms XGBoost in predicting
the invasive species gorse (Ulex europaeus),
using Sentinel satellite data, on the Banks Peninsula of NewZealand.
This project demonstrates land cover classifications in R using
machine learning algorithms such as Random Forest,
XGBoost, and SVM, and compares their
performance. This was adapted from the tutorial presented by Dr. Hanna
Meyer in July 2018 and has been adapted to include additional machine
learning models for comparison.
The focus is on land cover classification using sentinel satellite data, specifically targeting the invasive species gorse (Ulex europaeus) on the Banks Peninsula. This plant was brought to New Zealand during the European settlement as a hedge plant and spreads into farmland which is associated with negative consequences for the quality of the grassland.
Each year millions of dollars are spent for its control. To develop management strategies, it’s of high importance to map the distribution of such invasive plants. Since field sampling means huge expenses, remote sensing methods are required for a spatially explicit monitoring. In this project, Sentinel satellite imagery from 2017 were used to map the current land cover distribution on a section of the Banks Peninsula.
The project aim is to predict the invasive species gorse (Ulex europaeus) on the Banks Peninsula and focus on using machine learning algorithms such as Random Forests, XGBoost and SVM to perform this project.
The following methodology was undertaken for this project,
Raw data and land cover classes are visualized and merged to a
single data set to build the Machine Learning models
(Random Forest, XGBoost and
Support Vector Machine).
The merged data set is split into 30% training and 70% test data which is used to train and predict using the models.
Various analysis such as confusion matrix, scoring metrics, predictive & entropy maps, and uncertainty diagnostics were performed to analyse the models performance in land cover classification and predicting the invasive plant species.
This project shows the process of land cover (LC) classification using machine learning techniques using R. The focus is on classifying land cover types, particularly the invasive species gorse (Ulex europaeus), using Sentinel satellite data.
First, loading the libraries and packages that are needed for this LC classification project. The selected libraries provide functions for handling raster data, reading shapefiles, performing machine learning tasks, and visualizing results.
knitr::opts_chunk$set(echo = TRUE)
# Spatial and Raster data handling packages
library(raster) # Handling Raster data
library(sf) # Handling Vector data
library(terra) # Optimise large raster datasets
library(mapview) # Quick raster/vector visualization
# Data Manipulation & Reshaping
library(dplyr) # Data wrangling
library(tidyr) # Data reshaping
library(reshape2) # Data reshaping (melt/cast)
library(ROSE) # Balancing imbalanced datasets
# Model Building & Machine Learning packages
library(caret) # Model building
library(randomForest) # Random Forest
library(xgboost) # XGBoost
library(e1071) # SVM
library(rpart) # Recursive partitioning and Regression Trees
library(ranger) # Faster implementaion of RF
# Model Evaluation and Interpretation packages
library(ModelMetrics) # Performance Metrics
library(pROC) # ROC and AUC curves
library(DALEX) # Model explainability:
library(ingredients) # Aligns with DALEX
library(vip) # Visualizing feature importance
# Visualization, Plotting & Reporting packages
library(ggplot2) # Powerful and flexible plotting
library(RColorBrewer) # Predefined color palettes
library(rasterVis) # Customized raster visualization
library(kableExtra) # Table styling
library(viridis) # Colorblind-friendly maps
library(tinytex) # Compile LaTeX docs
To start with, loading the Sentinel data as well as a land cover class of the training sites.
# Load Sentinel data
sentinel <- stack("D:\\Study\\Machine Learning\\Geostatistics\\R-Git\\Case-Study_ML-for-Spatial-Data\\Data\\sentinel2017.grd")
# Print the Sentinel data
print(sentinel)
## class : RasterStack
## dimensions : 1257, 1405, 1766085, 10 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 655950, 670000, 5138290, 5150860 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=59 +south +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## names : B02, B03, B04, B08, B05, B06, B07, B8A, NDVI, yellowness
## min values : 619.0000000, 373.0000000, 194.0000000, 98.0000000, 168.3125000, 155.1875000, 139.0625000, 100.1875000, -0.6810878, -0.1066810
## max values : 5385.0000000, 5137.0000000, 5289.0000000, 6984.0000000, 4209.0625000, 5333.6875000, 6705.6875000, 7297.2500000, 0.8989085, 0.5362146
# Load training sites shapefile
training <- read_sf("D:\\Study\\Machine Learning\\Geostatistics\\R-Git\\Case-Study_ML-for-Spatial-Data\\Data\\trainingSites.shp")
# Print the training data
print(training)
## Simple feature collection with 23 features and 2 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 656358 ymin: 5138445 xmax: 669091.9 ymax: 5150851
## Projected CRS: WGS 84 / UTM zone 59S
## # A tibble: 23 × 3
## id Class geometry
## <dbl> <chr> <POLYGON [m]>
## 1 1 Gorse ((660956.4 5146064, 661018.8 5146061, 661033.1 5146008, 6609…
## 2 2 Gorse ((661120.6 5146267, 661197.3 5146263, 661205.7 5146231, 6611…
## 3 3 Gorse ((662279.2 5148528, 662388.2 5148465, 662382.2 5148421, 6622…
## 4 4 Gorse ((663908.7 5148424, 664023.7 5148401, 664054.9 5148404, 6641…
## 5 5 Gorse ((662357.1 5145141, 662514 5145113, 662547.6 5145081, 662636…
## 6 6 Water ((668392.1 5144225, 669091.9 5143123, 668248.4 5143200, 6683…
## 7 7 Water ((665171.5 5138877, 666369.7 5138445, 665401.6 5138522, 6651…
## 8 8 Water ((656358 5147753, 656722.2 5147403, 656444.2 5147446, 656358…
## 9 9 Grassland ((658817.8 5146617, 658875.3 5146725, 658936.4 5146717, 6589…
## 10 10 Grassland ((659407.2 5140035, 659565.4 5139956, 659539 5139786, 659457…
## # ℹ 13 more rows
Overview of the data:
The Sentinel data is a stack of raster layers, The sentinel data subset
for this tutorial contains the Sentinel channels 2-8 (visible and near
infrared channels) as well as the NDVI and a yellowness index that was
calculated as (red+green-blue)/(red+green+blue) as additional bands.
Assume the yellowness index is valuable to distinguish the striking
yellow color of the gorse from other vegetation.To get an idea about the
band composition and spatial resolution, please check the Sentinel
Wikipedia documentation here.
The training sites are polygons that were digitized in QGIS on the basis
of the Sentinel data using expert knowledge. The training dataset
contains different land cover types, including gorse, grassland, and
other vegetation types.
Visualize the data:
Visualize the Sentinel data as a true color composite in the
geographical context and overlay it with the polygons. Click on the
polygons to see which land cover class is assigned to a respective
polygon.
# Visualize Sentinel data with training sites
viewRGB(sentinel, r=3, g=2, b=1, map.types = "Esri.WorldImagery") +
mapview(training)
Preparing the data for machine learning by extracting raster values for the training sites and merging them with the land cover class information from the training shape file. This step is crucial for training the machine learning model. Then, splitting the data into training and test data set to ensure the evaluation of the model’s performance.
Extract raster information:
In order to train the Machine Learning model between the spectral
properties and the land cover class, create a data frame that contains
the spectral properties of the training sites as well as the
corresponding class information. This data frame can be produced with
the extract function. The resulting data frame contains the
Sentinel data for each pixel overlayed by the polygons. This data frame
then still needs to be merged with the information on the land cover
class from the shape file. This happens via the ID of the polygons which
are given in the extracted data frame by the column “ID” and in the
shape file by the attribute “id”.
# Extract raster values for training sites
extr <- raster::extract(sentinel, training, df=TRUE)
# Merge extracted data with training sites
extr <- merge(extr, training, by.x="ID", by.y="id")
kable(head(extr), caption = "Merged Data") %>%
kable_styling(full_width = F, position = "left")
| ID | B02 | B03 | B04 | B08 | B05 | B06 | B07 | B8A | NDVI | yellowness | Class | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 767 | 1113 | 1095 | 2854 | 1425.688 | 2358.625 | 2791.750 | 3091.375 | 0.4454292 | 0.4843698 | Gorse | POLYGON ((660956.4 5146064,… |
| 1 | 767 | 1049 | 1029 | 2667 | 1409.000 | 2320.438 | 2742.688 | 3037.125 | 0.4431818 | 0.4608084 | Gorse | POLYGON ((660956.4 5146064,… |
| 1 | 767 | 956 | 910 | 2444 | 1367.500 | 2213.312 | 2617.562 | 2923.375 | 0.4573644 | 0.4173946 | Gorse | POLYGON ((660956.4 5146064,… |
| 1 | 760 | 944 | 918 | 2418 | 1347.500 | 2184.625 | 2574.375 | 2874.875 | 0.4496403 | 0.4202898 | Gorse | POLYGON ((660956.4 5146064,… |
| 1 | 743 | 1017 | 981 | 2543 | 1349.000 | 2234.375 | 2613.125 | 2891.625 | 0.4432463 | 0.4578621 | Gorse | POLYGON ((660956.4 5146064,… |
| 1 | 737 | 991 | 939 | 2648 | 1352.000 | 2256.250 | 2637.562 | 2922.500 | 0.4764427 | 0.4473191 | Gorse | POLYGON ((660956.4 5146064,… |
Splitting data into training and test data:
In order to keep data for a later (nearly) independent validation as
well as to limit the number of data points so that the model training
won’t take long time, splitting the total data set into 30% training
data and 70% test data is necessary. The more testing (unknown) data the
higher the model accuracy. Caret’s createDataPartition
takes care that the class distribution is the same in both data
sets.
# Set seed for reproducibility
set.seed(100)
# Create training and test dataset
trainids <- createDataPartition(extr$Class, list=FALSE, p=0.3)
trainDat <- extr[trainids,]
testDat <- extr[-trainids,]
# Ensure Class is a factor and is at the same level
trainDat$Class <- factor(trainDat$Class)
testDat$Class <- factor(testDat$Class)
# Check the distribution of classes in training and test datasets
training_class = round(prop.table(table(trainDat$Class)) * 100, 2)
test_class = round(prop.table(table(testDat$Class)) * 100, 2)
kable(training_class, caption = "Training class balance") %>%
kable_styling(full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| Bare | 5.73 |
| Bush | 9.21 |
| Forestry | 9.27 |
| Gorse | 7.96 |
| Grassland | 6.41 |
| Sand | 2.81 |
| Urban | 2.28 |
| Water | 56.33 |
kable(test_class, caption = "Test class balance") %>%
kable_styling(full_width = F, position = "left")
| Var1 | Freq |
|---|---|
| Bare | 5.71 |
| Bush | 9.20 |
| Forestry | 9.27 |
| Gorse | 7.96 |
| Grassland | 6.39 |
| Sand | 2.77 |
| Urban | 2.29 |
| Water | 56.40 |
Class balance inferences shows that there are severe class imbalance
within both training and test data sets. It is required to balance the
class using up-sampling as the Water has a high frequency
and the remaining classes are similar to each other and much lower that
the Water.
Visualize relationships:
To get an idea about the relationships between the spectral Sentinel
data and the land cover class, visualize how the different bands differs
according to the land cover class. This helps us understand the
distribution of different classes in the feature space.
# Visualize relationships (box plot) between yellowness index and land cover class
ggplot(trainDat, aes(x = factor(Class), y = yellowness)) +
geom_boxplot() +
labs(title = "Yellowness Index by Land Cover Class",
x = "Land Cover Class",
y = "Yellowness Index") +
theme_minimal()
# Visualize relationships (box plot) between NDVI and land cover class
ggplot(trainDat, aes(x = factor(Class), y = NDVI)) +
geom_boxplot() +
labs(title = "NDVI by Land Cover Class",
x = "Land Cover Class",
y = "NDVI") +
theme_minimal()
The box plots show the distribution of the yellowness index and NDVI for each land cover class, providing insights into how these indices vary across different classes. The Gorse class features the highest “yellowness” values while all other land cover classes have considerably lower values.
Feature Plot:
To get an impression about the separability of the classes by creating a
feature plot that visualizes the location of the training samples in a
scatter plot of two Sentinel channels.
# Create feature plot to visualize relationships between Sentinel channels (B08 vs yellowness) and land cover class
ggplot(trainDat, aes(x = B08, y = yellowness, color = factor(Class))) +
geom_point() +
labs(title = "Feature Plot: Sentinel Band 8 vs Yellowness",
x = "Sentinel Band 8 (NIR)",
y = "Sentinel Band Yellowness",
color = "Land Cover Class") +
theme_minimal()
The feature plot shows the distribution of training samples in the feature space defined by Sentinel Band 8 (NIR) and the yellowness index. It helps us visualize how well the different land cover classes can be separated based on these features. The Gorse class features the highest “yellowness” values while all other land cover classes have considerably lower values. Based on the feature plot, note that there is only low separability considering Sentinel channel 3 and 4 (green and red) but high separability when channel 8 (near infrared) or the yellowness index is included.
The split data is now ready for training machine learning models and
ready for implementation in the three different algorithms:
Random Forest, XGBoost, and
Support Vector Machine (SVM). Each model will be trained on
the training data set and evaluated on the test data set.
Selecting predictor and response variable:
Before building the ML models, all the Sentinel bands and indices were
selected as the predictor variables and the land cover class were
selected as the response variables. The response variable is
categorical, while the predictor variables are continuous. From the
data, select all the information from the sentinel raster data as the
predictor variables and the land cover class as the response
variable.
# Select predictor variables (Sentinel bands and indices)
predictors <- c("B02", "B03", "B04", "B05", "B06", "B07", "B08", "NDVI", "yellowness")
# Select response variable (land cover class)
response <- 'Class'
Random Forest is a popular ensemble learning method that constructs
multiple decision trees and merges them together to get a more accurate
and stable prediction. It is particularly effective for classification
tasks and suits this LC classification analysis. Training the Random
Forest model using the randomForest package in R. The model
is trained on the training data set, and visualize the variable
importance to understand which features contribute most to the
classification. train function from the caret
package was used to train the model with 10-fold cross-validation for
hyperparameter tuning.
# Ensure response is a factor
trainDat[, response] <- as.factor(trainDat[, response])
# Define hyperparameter tuning grid
rf_grid <- expand.grid(
mtry = c(2, 3, 4), # Number of variables randomly sampled at each split
splitrule = "gini", # Splitting rule
min.node.size = c(1, 5) # Minimum size of terminal nodes
)
# Setting train control
control <- trainControl(
method = "cv", # k-fold CV
number = 10, # 10-fold
summaryFunction = multiClassSummary, # for multiclass metrics
classProbs = TRUE,
savePredictions = "final",
sampling = "up"
)
# Train Random Forest model
set.seed(100)
model_rf <- train(trainDat[, predictors], trainDat[, response],
method = "rf",
trControl = control,
importance = TRUE)
# Print model summary
kable(model_rf$results, caption = "Random Forest Model Results") %>%
kable_styling(full_width = F, position = "left")
| mtry | logLoss | AUC | prAUC | Accuracy | Kappa | Mean_F1 | Mean_Sensitivity | Mean_Specificity | Mean_Pos_Pred_Value | Mean_Neg_Pred_Value | Mean_Precision | Mean_Recall | Mean_Detection_Rate | Mean_Balanced_Accuracy | logLossSD | AUCSD | prAUCSD | AccuracySD | KappaSD | Mean_F1SD | Mean_SensitivitySD | Mean_SpecificitySD | Mean_Pos_Pred_ValueSD | Mean_Neg_Pred_ValueSD | Mean_PrecisionSD | Mean_RecallSD | Mean_Detection_RateSD | Mean_Balanced_AccuracySD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.0194079 | 0.9996537 | 0.3135054 | 0.9944470 | 0.9914640 | 0.9731647 | 0.9717758 | 0.9992815 | 0.9772389 | 0.9992905 | 0.9772389 | 0.9717758 | 0.1243059 | 0.9855286 | 0.0078280 | 0.0003859 | 0.0337306 | 0.0039935 | 0.0061337 | 0.0194530 | 0.0201904 | 0.0005168 | 0.0180031 | 0.0005101 | 0.0180031 | 0.0201904 | 0.0004992 | 0.0103524 |
| 5 | 0.0192705 | 0.9996919 | 0.2236961 | 0.9932740 | 0.9896542 | 0.9682776 | 0.9671660 | 0.9991251 | 0.9747691 | 0.9991392 | 0.9747691 | 0.9671660 | 0.1241592 | 0.9831455 | 0.0135754 | 0.0004931 | 0.0369808 | 0.0051617 | 0.0079476 | 0.0263027 | 0.0261678 | 0.0006728 | 0.0214679 | 0.0006581 | 0.0214679 | 0.0261678 | 0.0006452 | 0.0134152 |
| 9 | 0.0582227 | 0.9964809 | 0.1037647 | 0.9941511 | 0.9910037 | 0.9739457 | 0.9729646 | 0.9992407 | 0.9772993 | 0.9992473 | 0.9772993 | 0.9729646 | 0.1242689 | 0.9861027 | 0.0727093 | 0.0055828 | 0.0303429 | 0.0043519 | 0.0066949 | 0.0173850 | 0.0177504 | 0.0005681 | 0.0163332 | 0.0005644 | 0.0163332 | 0.0177504 | 0.0005440 | 0.0091501 |
# Extract variable importance
importance_df <- varImp(model_rf)$importance
# Add variable names as a column
importance_df$Variable <- rownames(importance_df)
# Compute overall importance (if multiple classes)
importance_df$Overall <- rowMeans(importance_df[, sapply(importance_df, is.numeric)])
# Round and sort in descending order
importance_df <- importance_df %>%
mutate(Overall = round(Overall, 3)) %>%
arrange(desc(Overall)) # High to low
# View top variables
kable(importance_df, caption = "Variable Importance for Random Forest Model") %>%
kable_styling(full_width = F, position = "left")
| Bare | Bush | Forestry | Gorse | Grassland | Sand | Urban | Water | Variable | Overall | |
|---|---|---|---|---|---|---|---|---|---|---|
| NDVI | 78.81922 | 51.74690 | 42.84082 | 67.26438 | 54.32107 | 82.43570 | 96.46106 | 27.60330 | NDVI | 62.687 |
| yellowness | 80.57554 | 44.51956 | 41.20206 | 75.49890 | 40.48474 | 53.72358 | 82.26013 | 20.75973 | yellowness | 54.878 |
| B05 | 57.84457 | 41.90438 | 38.77014 | 43.10041 | 40.90441 | 99.16067 | 79.66702 | 33.45632 | B05 | 54.351 |
| B07 | 70.04270 | 42.26636 | 33.83277 | 42.12963 | 35.77620 | 64.99589 | 84.73350 | 29.86652 | B07 | 50.455 |
| B06 | 65.81163 | 41.86432 | 32.35504 | 40.79115 | 30.54126 | 64.34196 | 84.33848 | 34.24737 | B06 | 49.286 |
| B04 | 59.79071 | 51.15562 | 31.61819 | 54.23198 | 30.60026 | 64.23863 | 77.03137 | 21.19083 | B04 | 48.732 |
| B02 | 93.18263 | 19.81698 | 15.63798 | 57.58772 | 16.88825 | 67.45459 | 100.00000 | 14.20616 | B02 | 48.097 |
| B08 | 54.83551 | 37.96788 | 31.79415 | 32.47370 | 29.06082 | 67.23859 | 92.70732 | 31.65667 | B08 | 47.217 |
| B03 | 70.19851 | 36.28995 | 41.09424 | 45.42369 | 33.77928 | 65.11085 | 72.26431 | 0.00000 | B03 | 45.520 |
# Plot variable importance
ggplot(importance_df, aes(x = reorder(Variable, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Variable Importance for RF model (High to Low)",
x = "Variable", y = "Importance Score") +
theme_minimal()
The Random Forest model is successfully trained and variable
importance is noted for the determining which predictor variable
(sentinel bands) has overall contributions to the classification. The
model achieved excellent performance across multiple values of
mtry, with the least logLoss results at 5. At
mtry at 5, the model shows,
Least uncertainity (logLoss = 0.0192),
Nearly perfect classification (AUC score = 0.999),
High classification (accuracy = 0.9933),
Good overall class performance (balanced accuracy = 0.9831) and
Balanced precision and recall (mean F1 score = 0.968).
The variable importance scores for each predictor variable is
derived. The variable importance plot shows which features contribute
most to the classification task, with the NDVI,
Yellowness & B05 being among the top 3
important predictors.
Predicting the test data with confusion matrix and scoring
metrics:
To evaluate the performance of the Random Forest model on the test data
set, confusion matrix and scoring metrics were calculated such as
accuracy, sensitivity, and specificity. The confusion matrix provides
insights into how well the model performs on the test data set.
# Predict on the test dataset
predictions_rf <- predict(model_rf,
newdata = testDat[, predictors])
# Extract the levels from the training data response
true_levels <- levels(trainDat[[response]])
# Convert both predicted and actual responses to factors with the same levels
predictions_rf <- factor(predictions_rf, levels = true_levels)
actual_rf <- factor(testDat[[response]], levels = true_levels)
# Construct the confusion matrix
confusion_rf <- caret::confusionMatrix(predictions_rf, actual_rf)
# Plot confusion matrix
cm_plot_rf <- ggplot(as.data.frame(confusion_rf$table),
aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
geom_text(aes(label = Freq), color = "black") +
labs(title = "Confusion Matrix for Random Forest Model",
x = "Reference Class",
y = "Predicted Class") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
cm_plot_rf
# Print scoring metrics
kable(confusion_rf$overall, caption = "Scoring Metrics for Random Forest Model") %>%
kable_styling(full_width = F, position = "left")
| x | |
|---|---|
| Accuracy | 0.9949774 |
| Kappa | 0.9922706 |
| AccuracyLower | 0.9931669 |
| AccuracyUpper | 0.9964094 |
| AccuracyNull | 0.5640382 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | NaN |
# Print sensitivity and specificity for each class
kable(confusion_rf$byClass, caption = "Sensitivity and Specificity for Random Forest Model") %>%
kable_styling(full_width = F, position = "left")
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: Bare | 0.9868132 | 0.9985351 | 0.9760870 | 0.9992004 | 0.9760870 | 0.9868132 | 0.9814208 | 0.0571321 | 0.0563787 | 0.0577599 | 0.9926741 |
| Class: Bush | 1.0000000 | 0.9998617 | 0.9986376 | 1.0000000 | 0.9986376 | 1.0000000 | 0.9993183 | 0.0920392 | 0.0920392 | 0.0921647 | 0.9999309 |
| Class: Forestry | 0.9986450 | 1.0000000 | 1.0000000 | 0.9998616 | 1.0000000 | 0.9986450 | 0.9993220 | 0.0926670 | 0.0925414 | 0.0925414 | 0.9993225 |
| Class: Gorse | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0796082 | 0.0796082 | 0.0796082 | 1.0000000 |
| Class: Grassland | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0639126 | 0.0639126 | 0.0639126 | 1.0000000 |
| Class: Sand | 0.9457014 | 0.9987085 | 0.9543379 | 0.9984506 | 0.9543379 | 0.9457014 | 0.9500000 | 0.0277499 | 0.0262431 | 0.0274987 | 0.9722049 |
| Class: Urban | 0.8846154 | 0.9976870 | 0.8994413 | 0.9973025 | 0.8994413 | 0.8846154 | 0.8919668 | 0.0228528 | 0.0202160 | 0.0224761 | 0.9411512 |
| Class: Water | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.5640382 | 0.5640382 | 0.5640382 | 1.0000000 |
Results:
Confusion Matrix and scoring metrics provide insights into the
performance of the Random Forest model on the test data set. The
confusion matrix shows how many instances were correctly classified for
each land cover class, while the scoring metrics (accuracy, sensitivity,
specificity) quantify the model’s overall performance.
Predictions on the test/unseen data, confusion matrix shows high true positives for all the classes predictions with an overall accuracy of 99.5%, but do have some irregularities in the classification. From the confusion matrix,
Water, Grassland, Gorse,
& Bush are 100% predicted and classified
correctly.
Out of 738 Forestry Class, 1 was predicted as
Bush
Out of 182 Urban Class, 10 were predicted as
Sand and 11 as Bare.
Out of 221 Sand Class, 12 were predicted as
Urban.
Out of 455 Bare class, 6 were predicted as
Urban.
This means that the Urban, Bare and
Sand show very little false positives with the RF
model.
XGBoost (Extreme Gradient Boosting) is a powerful and efficient
implementation of gradient boosting that is widely used for
classification tasks. It is known for its speed and performance, making
it suitable for large data sets. Training the XGBoost model using the
xgboost package in R. The model is trained on the training
data set, and visualize the variable importance to understand which
features contribute most to the classification. train
function from the caret package was used to train the model
with 10-fold cross-validation for hyperparameter tuning.
# Ensure response is a factor
trainDat[, response] <- as.factor(trainDat[, response])
# Define hyperparameter tuning grid
xgb_grid <- expand.grid(
nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
# Setting train control
control <- trainControl(
method = "cv", # k-fold CV
number = 10, # 10-fold
summaryFunction = multiClassSummary, # for multiclass metrics
classProbs = TRUE,
savePredictions = "final",
sampling = "up",
verboseIter = FALSE
)
# Train the XGBoost model using caret
set.seed(100)
model_xgb <- train(x = trainDat[, predictors],
y = trainDat[, response],
method = "xgbTree",
trControl = control,
tuneGrid = xgb_grid)
# Print model summary
kable(model_xgb$results, caption = "XGBoost Model Results") %>%
kable_styling(full_width = F, position = "left")
| nrounds | max_depth | eta | gamma | colsample_bytree | min_child_weight | subsample | logLoss | AUC | prAUC | Accuracy | Kappa | Mean_F1 | Mean_Sensitivity | Mean_Specificity | Mean_Pos_Pred_Value | Mean_Neg_Pred_Value | Mean_Precision | Mean_Recall | Mean_Detection_Rate | Mean_Balanced_Accuracy | logLossSD | AUCSD | prAUCSD | AccuracySD | KappaSD | Mean_F1SD | Mean_SensitivitySD | Mean_SpecificitySD | Mean_Pos_Pred_ValueSD | Mean_Neg_Pred_ValueSD | Mean_PrecisionSD | Mean_RecallSD | Mean_Detection_RateSD | Mean_Balanced_AccuracySD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 6 | 0.3 | 0 | 1 | 1 | 1 | 0.0283113 | 0.9997245 | 0.8622213 | 0.9929867 | 0.9892228 | 0.9708748 | 0.9697435 | 0.9990861 | 0.9750296 | 0.9990923 | 0.9750296 | 0.9697435 | 0.1241233 | 0.9844148 | 0.0222191 | 0.000436 | 0.0349824 | 0.0047978 | 0.0073673 | 0.0198403 | 0.0213631 | 0.0006256 | 0.015926 | 0.000619 | 0.015926 | 0.0213631 | 0.0005997 | 0.0109808 |
# Extract variable importance
importance_xgb <- xgb.importance(feature_names = predictors,
model = model_xgb$finalModel)
# Convert importance to a data frame
importance_df_xgb <- as.data.frame(importance_xgb)
importance_df_xgb <- importance_df_xgb %>%
mutate(Variable = reorder(Feature, Gain)) %>%
arrange(desc(Gain))
# Check the summary of the importance data frame
kable(importance_df_xgb, caption = "Variable Importance for XGBoost Model") %>%
kable_styling(full_width = F, position = "left")
| Feature | Gain | Cover | Frequency | Variable |
|---|---|---|---|---|
| B05 | 0.2319615 | 0.1726225 | 0.1417647 | B05 |
| yellowness | 0.1889079 | 0.1664336 | 0.1629412 | yellowness |
| B02 | 0.1794137 | 0.1838177 | 0.1776471 | B02 |
| B03 | 0.1585555 | 0.1987152 | 0.1423529 | B03 |
| B07 | 0.1433434 | 0.1102423 | 0.1052941 | B07 |
| B04 | 0.0420511 | 0.1087894 | 0.0747059 | B04 |
| NDVI | 0.0339618 | 0.0339363 | 0.0970588 | NDVI |
| B06 | 0.0186276 | 0.0191703 | 0.0582353 | B06 |
| B08 | 0.0031775 | 0.0062727 | 0.0400000 | B08 |
# Visualize variable importance using ggplot2
ggplot(importance_df_xgb, aes(x = Variable, y = Gain)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Variable Importance for XGBoost (High to Low)",
x = "Variable", y = "Importance Score") +
theme_minimal()
The XGBoost model is successfully trained and variable importance is
noted for the determining which predictor variable (sentinel bands) has
overall contributions to the classification. The variable importance
scores for each predictor variable is derived. The variable importance
plot shows which features contribute most to the classification task,
with the B05, Yellowness &
B02 being among the top 3 important predictors for the
XGBoost model.
The XGBoost model has been trained successfully with the selected hyperparameter. The model achieved excellent performance having,
Least uncertainity (logLoss = 0.0283),
Nearly perfect classification (AUC score = 0.999),
Perfect agreement beyond chance (Kappa = 0.989),
High classification (Accuracy = 0.993),
Good overall class performance (balanced accuracy = 0.984), and
Balanced precision and recall (mean F1 score = 0.97).
These score are very similar to the Random Forest Model and will be compared in the results and discussion section.
Predicting the test data with confusion matrix and scoring
metrics:
To evaluate the performance of the XGBoost model on the test data set,
confusion matrix and calculate scoring metrics were calculated such as
accuracy, sensitivity, and specificity. The confusion matrix provides
insights into how well the model performs on the test data set.
# Predict on the test dataset
predictions_xgb <- predict(model_xgb, newdata = testDat[, predictors])
# Extract the levels from the training data response
true_levels_xgb <- levels(trainDat[[response]])
# Convert both predicted and actual responses to factors with the same levels
predictions_xgb <- factor(predictions_xgb, levels = true_levels_xgb)
actual_xgb <- factor(testDat[[response]], levels = true_levels_xgb)
# Now create the confusion matrix
confusion_xgb <- caret::confusionMatrix(predictions_xgb, actual_xgb)
# Plot confusion matrix
cm_plot_xgb <- ggplot(as.data.frame(confusion_xgb$table), aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
geom_text(aes(label = Freq), color = "black") +
labs(title = "Confusion Matrix for XGBoost Model",
x = "Reference Class",
y = "Predicted Class") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
cm_plot_xgb
# Print scoring metrics
kable(confusion_xgb$overall, caption = "Scoring Metrics for XGBoost Model") %>%
kable_styling(full_width = F, position = "left")
| x | |
|---|---|
| Accuracy | 0.9927172 |
| Kappa | 0.9887915 |
| AccuracyLower | 0.9905954 |
| AccuracyUpper | 0.9944654 |
| AccuracyNull | 0.5640382 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | NaN |
# Print sensitivity and specificity for each class
kable(confusion_xgb$byClass, caption = "Sensitivity and Specificity for XGBoost Model") %>%
kable_styling(full_width = F, position = "left")
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: Bare | 0.9824176 | 0.9985351 | 0.9759825 | 0.9989342 | 0.9759825 | 0.9824176 | 0.9791895 | 0.0571321 | 0.0561276 | 0.0575088 | 0.9904763 |
| Class: Bush | 0.9986357 | 0.9995851 | 0.9959184 | 0.9998617 | 0.9959184 | 0.9986357 | 0.9972752 | 0.0920392 | 0.0919136 | 0.0922903 | 0.9991104 |
| Class: Forestry | 0.9972900 | 1.0000000 | 1.0000000 | 0.9997233 | 1.0000000 | 0.9972900 | 0.9986431 | 0.0926670 | 0.0924159 | 0.0924159 | 0.9986450 |
| Class: Gorse | 0.9952681 | 1.0000000 | 1.0000000 | 0.9995909 | 1.0000000 | 0.9952681 | 0.9976285 | 0.0796082 | 0.0792315 | 0.0792315 | 0.9976341 |
| Class: Grassland | 1.0000000 | 0.9997317 | 0.9960861 | 1.0000000 | 0.9960861 | 1.0000000 | 0.9980392 | 0.0639126 | 0.0639126 | 0.0641637 | 0.9998659 |
| Class: Sand | 0.9185520 | 0.9980628 | 0.9311927 | 0.9976762 | 0.9311927 | 0.9185520 | 0.9248292 | 0.0277499 | 0.0254897 | 0.0273732 | 0.9583074 |
| Class: Urban | 0.8571429 | 0.9966590 | 0.8571429 | 0.9966590 | 0.8571429 | 0.8571429 | 0.8571429 | 0.0228528 | 0.0195881 | 0.0228528 | 0.9269009 |
| Class: Water | 1.0000000 | 0.9997120 | 0.9997774 | 1.0000000 | 0.9997774 | 1.0000000 | 0.9998887 | 0.5640382 | 0.5640382 | 0.5641637 | 0.9998560 |
Results
Confusion Matrix and scoring metrics provide insights into the
performance of the XGBoost model on the test dataset. The confusion
matrix shows how many instances were correctly classified for each land
cover class, while the scoring metrics (accuracy, sensitivity,
specificity) quantify the model’s overall performance.
Predictions on the test/unseen data for the XGBoost model, confusion matrix shows high true positives for all the classes predictions with an overall accuracy of 99.27%, but do have some irregularities in the classification. From the confusion matrix,
Water, Grassland, &
Forestry are 100% predicted and classified
correctly.
Out of 182 Urban Class, 15 were predicted as
Sand and 11 as Bare
Out of 221 Sand Class, 18 were predicted as
Urban
Out of 634 Gorse Class, 2 were predicted as
Grassland and 1 as Urban
Out of 733 Bush Class, 1 was predicted as
Water
Out of 455 Bare class, 7 were predicted as
Urban and 1 as Bush.
This means that the Urban, Sand,
Gorse, Bush and Bare show little
false positives with the XGBoost model.
Support Vector Machine (SVM) is a supervised learning algorithm that
can be used for classification tasks. It works by finding the hyperplane
that best separates the classes in the feature space. SVM is effective
in high-dimensional spaces and is suitable for this LC classification
analysis. Training the SVM model using the e1071 package in
R. The model is trained on the training data set, and visualize the
variable importance to understand which features contribute most to the
classification. train function from the caret
package was used to train the model with 10-fold cross-validation for
hyperparameter tuning.
# Ensure response is a factor and class weights are a list
trainDat[, response] <- as.factor(trainDat[, response])
# Define hyperparameter tuning grid
svm_grid <- expand.grid(
C = 2^(-1:1), # Cost parameter
sigma = 2^(-1:1)) # Kernel parameter
# Setting train control
control <- trainControl(
method = "cv", # k-fold CV
number = 10, # 10-fold
summaryFunction = multiClassSummary,
classProbs = TRUE,
savePredictions = "final",
sampling = "up")
# Train Support Vector Machine (SVM) model
set.seed(100)
model_svm <- train(x = trainDat[, predictors],
y = trainDat[, response],
method = "svmRadial",
trControl = control,
tuneGrid = svm_grid)
# Print model summary
kable(model_svm$results, caption = "Support Vector Machine Model Results") %>%
kable_styling(full_width = F, position = "left")
| C | sigma | logLoss | AUC | prAUC | Accuracy | Kappa | Mean_F1 | Mean_Sensitivity | Mean_Specificity | Mean_Pos_Pred_Value | Mean_Neg_Pred_Value | Mean_Precision | Mean_Recall | Mean_Detection_Rate | Mean_Balanced_Accuracy | logLossSD | AUCSD | prAUCSD | AccuracySD | KappaSD | Mean_F1SD | Mean_SensitivitySD | Mean_SpecificitySD | Mean_Pos_Pred_ValueSD | Mean_Neg_Pred_ValueSD | Mean_PrecisionSD | Mean_RecallSD | Mean_Detection_RateSD | Mean_Balanced_AccuracySD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.5 | 0.5 | 1.303668 | 0.9287637 | 0.7638325 | 0.8514147 | 0.7737012 | NaN | 0.7151947 | 0.9808636 | NaN | 0.9813459 | NaN | 0.7151947 | 0.1064268 | 0.8480292 | 0.0169874 | 0.0167498 | 0.0204171 | 0.0059123 | 0.0091274 | NA | 0.0241865 | 0.0007683 | NA | 0.0007509 | NA | 0.0241865 | 0.0007390 | 0.0124584 |
| 4 | 1.0 | 0.5 | 1.294641 | 0.9382290 | 0.7742504 | 0.8531708 | 0.7763659 | NaN | 0.7187930 | 0.9810950 | NaN | 0.9815664 | NaN | 0.7187930 | 0.1066463 | 0.8499440 | 0.0201787 | 0.0117038 | 0.0155778 | 0.0057122 | 0.0087082 | NA | 0.0199254 | 0.0007519 | NA | 0.0007210 | NA | 0.0199254 | 0.0007140 | 0.0103050 |
| 7 | 2.0 | 0.5 | 1.284566 | 0.9387793 | 0.7762996 | 0.8546319 | 0.7785823 | NaN | 0.7244229 | 0.9812817 | NaN | 0.9817460 | NaN | 0.7244229 | 0.1068290 | 0.8528523 | 0.0197518 | 0.0127177 | 0.0158253 | 0.0077519 | 0.0117688 | NA | 0.0271449 | 0.0010165 | NA | 0.0009699 | NA | 0.0271449 | 0.0009690 | 0.0140707 |
| 2 | 0.5 | 1.0 | 1.267989 | 0.9068621 | 0.7192716 | 0.8508333 | 0.7728938 | NaN | 0.7154631 | 0.9808143 | NaN | 0.9812850 | NaN | 0.7154631 | 0.1063542 | 0.8481387 | 0.0152577 | 0.0280850 | 0.0423211 | 0.0058313 | 0.0089292 | NA | 0.0248521 | 0.0007649 | NA | 0.0007549 | NA | 0.0248521 | 0.0007289 | 0.0127806 |
| 5 | 1.0 | 1.0 | 1.258247 | 0.9096912 | 0.7258875 | 0.8531682 | 0.7764219 | NaN | 0.7224372 | 0.9811123 | NaN | 0.9815673 | NaN | 0.7224372 | 0.1066460 | 0.8517748 | 0.0226213 | 0.0312114 | 0.0474619 | 0.0050834 | 0.0077883 | NA | 0.0164739 | 0.0006658 | NA | 0.0006413 | NA | 0.0164739 | 0.0006354 | 0.0085204 |
| 8 | 2.0 | 1.0 | 1.248940 | 0.9048456 | 0.7143066 | 0.8537582 | 0.7773079 | NaN | 0.7196695 | 0.9811907 | NaN | 0.9816130 | NaN | 0.7196695 | 0.1067198 | 0.8504301 | 0.0256298 | 0.0313795 | 0.0414158 | 0.0088198 | 0.0131808 | NA | 0.0257316 | 0.0011465 | NA | 0.0010635 | NA | 0.0257316 | 0.0011025 | 0.0133953 |
| 3 | 0.5 | 2.0 | 1.253344 | 0.8683051 | 0.6853715 | 0.8479136 | 0.7684864 | NaN | 0.7074507 | 0.9804388 | NaN | 0.9809286 | NaN | 0.7074507 | 0.1059892 | 0.8439448 | 0.0203102 | 0.0104051 | 0.0206338 | 0.0043704 | 0.0065018 | NA | 0.0223733 | 0.0005722 | NA | 0.0005594 | NA | 0.0223733 | 0.0005463 | 0.0114568 |
| 6 | 1.0 | 2.0 | 1.246579 | 0.8706399 | 0.6880568 | 0.8499578 | 0.7715744 | NaN | 0.7133815 | 0.9807017 | NaN | 0.9811700 | NaN | 0.7133815 | 0.1062447 | 0.8470416 | 0.0191771 | 0.0124340 | 0.0201726 | 0.0050968 | 0.0076873 | NA | 0.0214185 | 0.0006621 | NA | 0.0006443 | NA | 0.0214185 | 0.0006371 | 0.0110071 |
| 9 | 2.0 | 2.0 | 1.242011 | 0.8712301 | 0.6902114 | 0.8505426 | 0.7724607 | NaN | 0.7155690 | 0.9807767 | NaN | 0.9812454 | NaN | 0.7155690 | 0.1063178 | 0.8481728 | 0.0138274 | 0.0117582 | 0.0214822 | 0.0047950 | 0.0072443 | NA | 0.0190069 | 0.0006270 | NA | 0.0006159 | NA | 0.0190069 | 0.0005994 | 0.0097863 |
# Extract variable importance
importance_svm <- varImp(model_svm)$importance
# Add variable names as a column
importance_svm$Variable <- rownames(importance_svm)
# Compute overall importance (if multiple classes)
importance_svm$Overall <- rowMeans(importance_svm[, sapply(importance_svm, is.numeric)])
# Round and sort in descending order
importance_svm <- importance_svm %>%
mutate(Overall = round(Overall, 3)) %>%
arrange(desc(Overall)) # High to low
# Print top variables
kable(importance_svm, caption = "Variable Importance for Support Vector Machine Model") %>%
kable_styling(full_width = F, position = "left")
| Bare | Bush | Forestry | Gorse | Grassland | Sand | Urban | Water | Variable | Overall | |
|---|---|---|---|---|---|---|---|---|---|---|
| B04 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | B04 | 100.000 |
| NDVI | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | 100.00000 | NDVI | 100.000 |
| yellowness | 100.00000 | 99.91037 | 99.91037 | 99.91037 | 99.91037 | 100.00000 | 100.00000 | 100.00000 | yellowness | 99.955 |
| B02 | 100.00000 | 99.61459 | 99.61459 | 100.00000 | 99.61459 | 99.61459 | 99.61459 | 100.00000 | B02 | 99.759 |
| B03 | 100.00000 | 98.08192 | 98.08192 | 99.26476 | 98.08192 | 98.08192 | 100.00000 | 100.00000 | B03 | 98.949 |
| B05 | 100.00000 | 80.18284 | 80.18284 | 80.18284 | 80.18284 | 100.00000 | 99.98892 | 100.00000 | B05 | 90.090 |
| B07 | 60.91243 | 99.97924 | 100.00000 | 60.91243 | 60.91243 | 100.00000 | 60.91243 | 99.97924 | B07 | 80.451 |
| B08 | 25.39213 | 99.95848 | 100.00000 | 25.39213 | 25.39213 | 100.00000 | 25.39213 | 99.95848 | B08 | 62.686 |
| B06 | 0.00000 | 99.23189 | 100.00000 | 0.00000 | 0.00000 | 100.00000 | 0.00000 | 99.23189 | B06 | 49.808 |
# Plot variable importance
ggplot(importance_svm, aes(x = reorder(Variable, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Variable Importance for SVM model (High to Low)",
x = "Variable", y = "Importance Score") +
theme_minimal()
The SVM model is trained, summary results and variable importance are
noted for the determining which predictor variable (sentinel bands) has
overall contributions to this classification.The variable importance
scores for each predictor variable is derived. The variable importance
plot shows which features contribute most to the classification task,
with the NDVI, B04 &
Yellowness being among the top 3 important predictors for
the XGBoost model.
Predicting the test data with confusion matrix and scoring
metrics:
To evaluate the performance of the SVM model on the test data set,
confusion matrix and scoring metrics such as accuracy, sensitivity, and
specificity were calculted. The confusion matrix provides insights into
how well the model performs on the test data set.
# Predict on the test dataset
predictions_svm <- predict(model_svm,
newdata = testDat[, predictors])
probs_svm <- predict(model_svm,
newdata = testDat[, predictors],
probability = TRUE)
# Extract the levels from the training data response
true_levels_svm <- levels(trainDat[[response]])
# Convert both predicted and actual responses to factors with the same levels
predictions_svm <- factor(predictions_svm,
levels = true_levels_svm)
actual_svm <- factor(testDat[[response]],
levels = true_levels_svm)
# Construct the confusion matrix
confusion_svm <- caret::confusionMatrix(predictions_svm, actual_svm)
# Plot confusion matrix
cm_plot_svm <- ggplot(as.data.frame(confusion_svm$table), aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
geom_text(aes(label = Freq), color = "black") +
labs(title = "Confusion Matrix for Support Vector Machine Model",
x = "Reference Class",
y = "Predicted Class") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot Confusion Matrix
cm_plot_svm
# Print scoring metrics
kable(confusion_svm$overall, caption = "Scoring Metrics for Support Vector Machine Model") %>%
kable_styling(full_width = F, position = "left")
| x | |
|---|---|
| Accuracy | 0.8528378 |
| Kappa | 0.7756107 |
| AccuracyLower | 0.8448654 |
| AccuracyUpper | 0.8605510 |
| AccuracyNull | 0.5640382 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | NaN |
# Print sensitivity and specificity for each class
kable(confusion_svm$byClass, caption = "Sensitivity and Specificity for Support Vector Machine Model") %>%
kable_styling(full_width = F, position = "left")
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: Bare | 1.0000000 | 0.9957384 | 0.9342916 | 1.0000000 | 0.9342916 | 1.0000000 | 0.9660297 | 0.0571321 | 0.0571321 | 0.0611502 | 0.9978692 |
| Class: Bush | 1.0000000 | 0.9994468 | 0.9945726 | 1.0000000 | 0.9945726 | 1.0000000 | 0.9972789 | 0.0920392 | 0.0920392 | 0.0925414 | 0.9997234 |
| Class: Forestry | 0.9945799 | 1.0000000 | 1.0000000 | 0.9994467 | 1.0000000 | 0.9945799 | 0.9972826 | 0.0926670 | 0.0921647 | 0.0921647 | 0.9972900 |
| Class: Gorse | 0.0662461 | 0.9998636 | 0.9767442 | 0.9252620 | 0.9767442 | 0.0662461 | 0.1240768 | 0.0796082 | 0.0052737 | 0.0053993 | 0.5330548 |
| Class: Grassland | 0.0000000 | 1.0000000 | NaN | 0.9360874 | NA | 0.0000000 | NA | 0.0639126 | 0.0000000 | 0.0000000 | 0.5000000 |
| Class: Sand | 0.9683258 | 0.8545783 | 0.1597015 | 0.9989432 | 0.1597015 | 0.9683258 | 0.2741832 | 0.0277499 | 0.0268709 | 0.1682572 | 0.9114521 |
| Class: Urban | 0.6703297 | 0.9988435 | 0.9312977 | 0.9923401 | 0.9312977 | 0.6703297 | 0.7795527 | 0.0228528 | 0.0153189 | 0.0164490 | 0.8345866 |
| Class: Water | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.5640382 | 0.5640382 | 0.5640382 | 1.0000000 |
Results
Confusion Matrix and scoring metrics provide insights into the
performance of the Support Vector Machine (SVM) model on the test data
set. The confusion matrix shows how many instances were highly
misclassified for each land cover class, while the scoring metrics
(accuracy, sensitivity, specificity) shows the model’s poor
performance.
The SVM model has not been trained successfully because the model achieved poor performance having very low scores for the different scoring matrix. The confusion matrix shows that the model has not classified the defined land cover classes as there are lot of misclassification. Hence the model does not fit this land cover classification and will not be further used to compare the models and assess the results and discussions of this project.
Now the predicted model maps are visualized to check the land cover classification and lay the training classes on the predicted models to verify the classification visually.
# Stacking Sentinel bands + indices
predictors_stack <- stack(sentinel$B02,
sentinel$B03,
sentinel$B04,
sentinel$B05,
sentinel$B06,
sentinel$B07,
sentinel$B08,
sentinel$NDVI,
sentinel$yellowness)
# Convert raster stack to dataframe (pixel wise)
pixels_df <- as.data.frame(predictors_stack, na.rm = TRUE)
# Predict for raster cells for each model
predictions_rf <- predict(model_rf, newdata = pixels_df, type = "raw")
predictions_xbg <- predict(model_xgb, newdata = pixels_df, type = "raw")
# Create raster layers for each model prediction
raster_rf <- setValues(raster(predictors_stack), predictions_rf)
raster_xgb <- setValues(raster(predictors_stack), predictions_xbg)
# Define class names and palette
classes <- c("Bare", "Bush", "Forestry", "Gorse", "Grassland", "Sand", "Urban", "Water")# Replace with your actual column name
class_colors <- c(
"Bare" = "#a6643a", # brown
"Bush" = "#d01c8b", # magenta
"Forestry" = "#1b7837", # dark green
"Gorse" = "#ffff00", # yellow
"Grassland" = "#4dac26", # light green
"Sand" = "#f6e8c3", # sand yellow
"Urban" = "#878787", # gray
"Water" = "#4393c3" # blue
)
palette <- unname(class_colors[classes])
# Convert to SpatRaster (terra) and assign class levels
raster_rf <- rast(raster_rf); levels(raster_rf) <- data.frame(value = 1:8, class = classes)
raster_xgb <- rast(raster_xgb); levels(raster_xgb) <- data.frame(value = 1:8, class = classes)
mapview(raster_rf, col.regions = palette, na.color = NA) +
mapview(training, zcol = "Class", col.regions = "black", alpha = 0.5)
mapview(raster_xgb, col.regions = palette, na.color = NA) +
mapview(training, zcol = "Class", col.regions = "black", alpha = 0.5)
Discussions:
The predicted thematic maps are overlayed with the training land cover
classes to visually validated the model performance. Based on the maps,
although there are slight misclassifications, both the models shows
utmost accurate classification as indicated by the confusion matrix and
scoring metrics. Out of the two, Random Forest shows better
performance in land cover classification.
Uncertainty analysis aims to quantify the reliability of predicted land cover maps derived from remote sensing data. It involves assessing and managing errors and uncertainties associated with the classification process, including those arising from data acquisition, processing, and the classification algorithms themselves.
1. Confusion matrices & Scoring metrics
Comparing the Confusion matrices of the models to verify each model performed in classifying the land cover types. The confusion matrices show the number of true positives, false positives, true negatives, and false negatives for each class. This allows to assess the strengths and weaknesses of each model in terms of classification accuracy.
# Create a list of confusion matrices for each model
cm_plot_rf
cm_plot_xgb
Compare the scoring metrics (precision, f1score, recall, accuracy,
sensitivity, specificity) of the RF &
XGBoost models to see which model performed best overall.
The accuracy metric indicates the overall correctness of the model,
while sensitivity and specificity provide insights into how well the
model identifies each class.
# Create a data frame to hold the scoring metrics for each model from the test dataset
scoring_df <- data.frame(
Metric = c('Accuracy', 'Kappa', 'Precision', 'Sensitivity', 'Specificity','F1'),
RandomForest = c(confusion_rf$overall['Accuracy'],
confusion_rf$overall['Kappa'],
mean(confusion_rf$byClass[,"Precision"]),
mean(confusion_rf$byClass[,"Sensitivity"]),
mean(confusion_rf$byClass[,"Specificity"]),
mean(confusion_rf$byClass[,"F1"])),
XGBoost = c(confusion_xgb$overall['Accuracy'],
confusion_xgb$overall['Kappa'],
mean(confusion_xgb$byClass[,"Precision"]),
mean(confusion_xgb$byClass[,"Sensitivity"]),
mean(confusion_xgb$byClass[,"Specificity"]),
mean(confusion_rf$byClass[,"F1"])))
# Print the scoring metrics data frame
kable(scoring_df, caption = "Scoring Metrics for Each Model") %>%
kable_styling(full_width = F, position = "left")
| Metric | RandomForest | XGBoost |
|---|---|---|
| Accuracy | 0.9949774 | 0.9927172 |
| Kappa | 0.9922706 | 0.9887915 |
| Precision | 0.9785630 | 0.9695125 |
| Sensitivity | 0.9769719 | 0.9686633 |
| Specificity | 0.9993490 | 0.9990357 |
| F1 | 0.9777535 | 0.9777535 |
# Convert to long format for plotting
scoring_df_long <- melt(scoring_df, id.vars = "Metric",
variable.name = "Model", value.name = "Score")
# Plot the scoring metrics for each model
ggplot(scoring_df_long, aes(x = Metric, y = Score, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Scoring Metrics Comparison between the models",
x = "Metrix", y = "Score") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Discussion - Model Performance:
Confusion matrix from Random Forest and
XGBoost are shown here. As per the results from the
confusion matrix, the Random Forest Model scores high true
positives and very low false positives, when compared with the
XGBoost model. Furthermore, the scoring metrics provide a
comprehensive overview of the performance of each model. The
Random Forest model achieved the highest accuracy compared
to the XGBoost. The precision and sensitivity scores
indicate that Random Forest performed well in identifying the
gorse class than the XGBoost. The F1
score, which balances precision and recall, also shows that
Random Forest outperformed XGBoost.
2. Probability-based Entropy Maps (for
uncertainty)
For this uncertainity analysis, first predicting the class probabilities
for the models is required. With the predicted class probabilities,
entropy per pixel is calculated and then map these entropy values as a
raster data.
# Preparing the data
# Convert the raster predictors to dataframe
pred_data <- as.data.frame(predictors_stack, na.rm = TRUE)
valid_idx <- which(complete.cases(pred_data))
pred_data_valid <- pred_data[valid_idx, ]
# Predicting the model probabilities
prob_rf <- predict(model_rf, pred_data, type='prob')
prob_xgb <- predict(model_xgb, pred_data, type='prob')
# Computing the entropy for the models
compute_entropy <- function (p) {
epsilon <- 1e-12 # Avoid log 0
p <- pmax(pmin(p, 1 - epsilon), epsilon) # Clip values between epsilon and 1 - epsilon
-rowSums(p * log2(p))
}
# Apply the model probabilitites to the entropy function
entropy_rf <- compute_entropy(as.matrix(prob_rf))
entropy_xgb <- compute_entropy(as.matrix(prob_xgb))
# Setting up the plot
# Create templates for the model
entropy_raster_rf <- raster::raster(predictors_stack)
entropy_raster_xgb <- raster::raster(predictors_stack)
# Setting all values to NA
values(entropy_raster_rf) <- NA
values(entropy_raster_xgb) <- NA
# Inserting the entropy values at valid locations
values(entropy_raster_rf)[valid_idx] <- entropy_rf
values(entropy_raster_xgb)[valid_idx] <- entropy_xgb
# Convert raster to data frame for ggplot
# Random Forest
df_entropy_rf <- as.data.frame(entropy_raster_rf,
xy = TRUE, na.rm = TRUE)
colnames(df_entropy_rf)[3] <- "entropy"
# XGBoost
df_entropy_xgb <- as.data.frame(entropy_raster_xgb,
xy = TRUE, na.rm = TRUE)
colnames(df_entropy_xgb)[3] <- "entropy"
# Create the ggplot for Random Forest entropy
p_rf <- ggplot(df_entropy_rf, aes(x = x, y = y, fill = entropy)) +
geom_raster() +
scale_fill_viridis(option = "viridis") +
coord_equal() +
labs(title = "Entropy - Random Forest", fill = "Entropy") +
theme_minimal()
# Create the ggplot for XGBoost entropy
p_xgb <- ggplot(df_entropy_xgb, aes(x = x, y = y, fill = entropy)) +
geom_raster() +
scale_fill_viridis(option = "viridis") +
coord_equal() +
labs(title = "Entropy - XGBoost", fill = "Entropy") +
theme_minimal()
# Plot the entropy maps
par(mfrow=c(1,2))
p_rf
p_xgb
par(mfrow=c(1,1))
Discussion
These entropy maps shows the uncertainity in the predictions in the
model. More yellow the higher entropy and more purple the lower entropy.
Low entropy signifies the confidence in predictions where high entropy
shows the uncertaininty and might be similar probabilities to multiple
class. In these maps, RF entropy maps shows more gradual transitions and
less extreme entropy which means that the model could be robust to noise
and less flexible. On the contrary, XGBoost shows sharper contrasts and
more high entropy areas particularty at the class boundaries. This could
mean the model classification is sensitive to the class boundaries and
also could be the problem of overfitting.
3. DALEX Uncertainity Analysis
To gain insights into the uncertainty of the model predictions, the
DALEX (Descriptive mAchine Learning EXplanations) package was designed
to help understand and explain complex machine learning models. DALEX
allows to explore the model diagnostics for each model. This allows to
analyze the model’s predictions and understand how different features
contribute to the uncertainty in the predictions.
# DALEX - Prediction Diagnostics
# Define a safe predict function
predict_class <- function(m, d) predict(m, d, type = "raw")
# Create explainers for each model using DALEX
explainer_rf <- DALEX::explain(model_rf,
data = testDat[, predictors],
y = testDat$Class,
label = "Random Forest")
## Preparation of a new explainer is initiated
## -> model label : Random Forest
## -> data : 7964 rows 9 cols
## -> target variable : 7964 values
## -> predict function : yhat.train will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package caret , ver. 7.0.1 , task multiclass ( default )
## -> predicted values : predict function returns multiple columns: 8 ( default )
## -> residual function : difference between 1 and probability of true class ( default )
## -> residuals : numerical, min = 0 , mean = 0.01093245 , max = 0.996
## A new explainer has been created!
explainer_xgb <- DALEX::explain(model_xgb,
data = testDat[, predictors],
y = testDat$Class,
label = "XGBoost")
## Preparation of a new explainer is initiated
## -> model label : XGBoost
## -> data : 7964 rows 9 cols
## -> target variable : 7964 values
## -> predict function : yhat.train will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package caret , ver. 7.0.1 , task multiclass ( default )
## -> predicted values : predict function returns multiple columns: 8 ( default )
## -> residual function : difference between 1 and probability of true class ( default )
## -> residuals : numerical, min = 1.502037e-05 , mean = 0.007651202 , max = 0.9999674
## A new explainer has been created!
# Take 5 random observations from the test set
set.seed(123)
new_obs <- testDat[1, predictors]
# Compute diagnostics for each model
diag_rf <- predict_diagnostics(explainer_rf,
new_observation = new_obs, N=NULL)
diag_xgb <- predict_diagnostics(explainer_xgb,
new_observation = new_obs, N=NULL)
# Plotting uncertainty histograms
hist_rf <- rbind(diag_rf$histogram_neighbors, diag_rf$histogram_all)
hist_rf$model <- "Random Forest"
hist_xgb <- rbind(diag_xgb$histogram_neighbors, diag_xgb$histogram_all)
hist_xgb$model <- "XGBoost"
# Combine all
combined_hist <- rbind(hist_rf, hist_xgb)
# Plotting the histogram
ggplot(combined_hist, aes(x = Var1, y = Freq, fill = direction)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~model, scales = "free_y") +
scale_fill_manual(values = c("neighbors" = "steelblue", "all" = "tomato")) +
theme_minimal() +
labs(title = "Similarity to Training Data",
x = "Distance Bin",
y = "Frequency",
fill = "Compared To") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, , size = 5))
Discussion:
Both the models show a sharp spike at very low bin distance bins which
means that the test dataset is highly similar to the training dataset,
particularly in the XGBoost. This suggests that both models use very
similar training data to make predictions. In the RF model, the model
red bar are all negative, which shows that the test data as a whole is
less similar to the entire training data distribution. This suggests
that the RF model has some overfitting and learned well from local
structures. From both the plots, RF has more spread across the bin
whereas the XGBoost has a sharp spike at the lowest bin. This shows that
the XGBoost heavily relies on few highly similar instances, whereas RF
has spreads its attention.
The project aim is to predict the invasive species gorse
(Ulex europaeus) on the Banks Peninsula in New
Zealand. Based this project goal, using the Sentinel satellite images
and plotted land cover class data shapes of the study area, three
machine learning models (Random Forest (RF),
XGBoost & Support Vector Machine (SVM) are
selected and trained with similar training control methods (Cross
validation method and k-fold at 10). Out of the three models,
RF and XGBoost model performs very well and
SVM model is not upto par in classifying the land cover
class using the satellite images.
Based on various analysis such as confusion matrix, scoring metrics,
predictive & entropy maps, and uncertainy diagonostics,
RF model has low uncertainty and predicts the land cover
classes with the highest accuracy of 99.33%, when compared to the
XGBoost. In conclusion, Random Forest Model
outperforms XGBoost in predicting the invasive species
gorse (Ulex europaeus), using Sentinel
satellite data, on the Banks Peninsula of New Zealand.